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ABSTRACT 

The aim of this paper is to present a first evaluation of the 
potential of an asynchronous distributed computation asso- 
ciated to the recently proposed approach, D-iteration: the 
D-iteration is a fluid diffusion based iterative method, which 
has the advantage of being natively distributive. It exploits 
a simple intuitive decomposition of the matrix-vector prod- 
uct as elementary operations of fluid diffusion associated to a 
new algebraic representation. We show through experiments 
on real datasets how much this approach can improve the 
computation efficiency when the parallelism is applied: with 
the proposed solution, when the computation is distributed 
over K virtual machines (PIDs), the memory size to be han- 
dled by each virtual machine decreases linearly with K and 
the computation speed increases almost linearly with K with 
a slope becoming closer to one when the number N of linear 
equations to be solved increases. 

Categories and Subject Descriptors 

G.l.O [Mathematics of Computing]: Numerical Anal- 
ysis — Parallel algorithms-, G.1.3 [Mathematics of Com- 
puting]: Numerical Analysis — Numerical Linear Algebra-, 
G.2.4 [Computer Systems Organization]: Gomputer- 
Gommunication Networks — Distributed Systems 

General Terms 

Algorithms, Performance 

Keywords 

Distributed computation. Iteration, Fixed point. Eigenvec- 
tor. 

1. INTRODUCTION 

Surprisingly enough, there was a recent result [S] showing 
a potentially significant reduction of the numerical compu- 
tation cost in a very classical problem of the calculation of 
the eigenvector of a large sparse matrix, based on a new 
representation/interpretation/decomposition of the matrix- 
vector product as elementary operations of fluid diffusion 
(cf. IH1[T]). This is an alternative solution to existing itera- 
tive methods (cf. [51 1161 1^1: its potential in the context of 
PageRank equation has been shown in [9] and the applica- 
tion of the approach in a general context is described in [S] 
(D-iteration). 



The complexity of the computation of the eigenvector of a 
matrix is a very well known problem and it increases rapidly 
with the dimension of the vector space. Efficient, accurate 
methods to compute eigenvectors of arbitrary matrices are 
in general a difficult problem (cf. power iteration | 17| . QR 
algorithm EldS]). 

Starting from the algorithm proposed in [9] (a computa- 
tion speed-up by factor 4-40 depending on the graphs was 
observed), we present here a first evaluation of the speed- 
up factor that can be cumulatively applied on the previous 
performance when a distributive architecture is used. There 
have been a lot of researches concerning the distributed com- 
putation of the linear equations (cf. [3l ll0lfT5lll21lll |l. with 
a particular interest on asynchronous iteration scheme. The 
elimination of the synchronization constraints is important 
on heterogeneous platforms for a better efficiency and for 
an effective scalability to distributed platforms. The dis- 
tributive algorithm defined and evaluated here is directly 
inspired from the first asynchronous distributed scheme that 
has been described in [?]• 

We recall that the PageRank equation can be written un- 
der the form: 

A = P'X = P.X + B (1) 

where P and P' are non-negative square matrices of size N x 
N {P' is the stochastic matrix associated to the transition 
probability), B a non-negative vector of size N and X the 
eigenvector of P' {xi gives the score of the importance of 
the page i). 

We recall that the D-iteration approach works when the 
spectral radius of P is strictly less than 1 (so that the power 
series X^n>o is convergent). In this paper, we will focus 
only on the specific case when P is strictly contractive: for 
the PageRank equation, the damping factor d is the explicit 
contraction coefficient (the usually considered value of d is 
0.85). 

The main objective of this paper is to evaluate through 
simple simulations (based on a single PG for now) the effi- 
ciency of the D-iteration based simple distributed computa- 
tion architecture, when applied on a large matrix (current 
simulations limited to N = 10®) in the context of PageRank 
type equation. 

In Section [51 we define the distributed algorithm that is 
used. The evaluation of the computation efficiency in the 
context of the PageRank equation is studied in Section [5] 

2. DISTRIBUTED AUGORITHM 



The fluid diffusion model in the general case is described 
by the matrix P associated with a weighted graph {pij is 
the weight of the edge from j to i, positive or negative) and 
the initial condition Fo — B (cf. [8]). 

We recall the definition of the two vectors used in the 
D-iteration method: the fluid vector Fn is defined by: 

Fn = {F — Jin F PJin)Bn — l- (2) 

where: 

• Id is the identity matrix; 

• I — with in € A^} is a deter- 

ministic or random sequence such that the number of 
occurrence of each value k £ {1, in I is infinity; 

• Jk a, matrix with all entries equal to zero except for 
the fc-th diagonal term: {Jk)kk = 1- 

And the history vector Hn defined by [Hq initialized to a 
null vector): 

n 

Hn = (3) 

fe=i 

The distance to the limit X (Li norm \X — Hn\) is then 
bounded by r/(l — d), where r is the residual fluid r = IT^I 
and where d is the contraction factor of P (equal to the 
damping factor for PageRank equation). Note that Hn is 
an increasing function for each component (when P and B 
are non-negative), so that: \X — Hn\ = \^i ~ {Hn)i\ = 

- {Hn)i) Therefore, \X - Hn\ = 1 - \Hn\ when X 
is a probability vector. 

2.1 Pseudo-code 

We recall that the above equations m and are the 
mathematical formulations of the following algorithm (D- 
iteration) : 

Initialization: 

H[i] := 0; 

F[i] := B_i; 
r := IFI; 

Iteration: 
k := 1; 

While ( r/(l-d) > Target_Error ) 

Choose i_k; 
sent : = F [i_k] ; 

H[i_k] += sent; 

F[i_k] := 0; 

If ( i_k has at least one child ) 

For all child node j of i_k: 

F[j] += sent * p(j,i_k); 
r := IFI; 
k++ ; 

Remark 1. Of course, we don’t have to compute the quan- 
tity r in each step of iteration. 

2.2 PageRank equation 

We recall that in the context of PageRank equation, P 
is of the form: dQ where Q is a stochastic matrix (or sub- 
stochastic matrix in presence of dangling nodes) and satisfies 
the equation: 



with V a personalization vector (by default, equal to the uni- 
form probability vector {1/N, .., 1/N)*) and d is the damping 
factor. Equivalently, we can write for each entry of X: 

Xi = +dJ2qijXj. (5) 

j 

And we have, qij = , where ffoutj is the number of 

outgoing links from j (when there is at least one outgoing 
link from j, otherwise, one may complete the j-th column 
of Q by 1/N cf. [HIIE]). 

2.3 Partition sets 

In the following, we consider two simple K partitioning 
sets: 

• Uniform partition: fli = {1,2, N/ K}, fl 2 = {N/K-I 
1,2, ...,2 X N/K}, etc 

• Cost Balanced (CB) partition: Ilk = {oJk,oJk+l, ■■■,0Jk+i — 
1} such that '^//Fff~^{ifoutn) = L/K, where L is the 
total number of non-null entries of Q (or the total num- 
ber of links of the graph associated to Q), 

such that {1, ..., N} = 1} — UkUk- The intuition of the cost 
balanced partition is that when we apply the diffusion iter- 
ation on all nodes of each Qk, the diffusion cost is constant. 
The main reason why we chose this is the simplicity of its 
computation: as a possibly better alternative solution, we 
could equalize the quantity: (which gives a 

very rough information on at which speed the residual fluid 
- cf. Section EH- of the set Ilk disappears). 

As described in (3, each PIDk computes {Fn)i and {Hn)i 
for i £ Uk ([E„]fe and [Hn]k) and those informations are 
exchanged between PIDs. In the following, we assumed 
that there is no communication cost when information is 
exchanged between PIDs. 

Remark 2. We believe that there is here some nice sim- 
ple adaptation scheme to be designed making the set Ok dy- 
namically evolve in time, with the idea that an idle PID could 
take nodes from the busiest PID: such an operation is indeed 
quite easy with our diffusion model: we just need to transmit 
the information Fn and Hn of the candidate nodes (then, 
possibly re-forwarding fluid to the new destination PID, if 
not well synchronized etc). 

2.4 PID modelling 

We consider a time stepped approximation for the simu- 
lation of the distributed computation cost (for now running 
on a single PC): during each time step, each PID can execute 
PID^Speedk operations. By default, we set: PI D^Speedk = 
PID_Speed = L/K (PIDs are assumed to compute at the 
same speed). 

We define an elementary operation cost as a diffusion cost 
from one node to another node. 

In the following, the number of iterations is defined as the 
normalized elementary operation cost dividing the number 
of operations by L (so that it can be easily compared to the 
cost of one matrix- vector product, or one iteration in power 
iteration). 

Each PIDk applies the diffusion from nodes in Ok', the 
diffusion iteration is done until the residual fluid 

rk = |[E„]fc| = {Fn)i 



A = dQX -f (1 - d)V, 



( 4 ) 



of Qk is below a certain threshold: 

Vk < T. 

When this target is reached, this PIDk can go in sleep mode. 

In each Q.k, the choice of the node sequence Ik is done 
based on the ideas proposed in [SE]: 

4*’^ = a.Tgmax {{Fn-i)i/ + 1) x (#outi + 1))) 

with (resp. ^outi) equal to the number of incoming 

(resp. outgoing) links to (resp. from) the node i. To ap- 
proximate the computation of the arg max (the computation 
of the exact value is computation costly and not necessary), 
we used a cyclic test on nodes and all i such that {Fn)i is 
above a threshold Tk are chosen, then we scale down the 
threshold progressively by a factor a > 1: 

T_k \= alpha. 

We observed that the results below are not sensitive to the 
choice of a (between 1.2 and 2, by default we chose 1.5). 

2.5 Fluid transmission to external nodes 

By external nodes, we mean here all nodes j ^ Qk such 
that PIDk has {Fn)j waiting for the transmission to a PIDk' 
U & ^k')- Each PID monitors its residual fluid rk and the 
fluid amount to be sent s*,. The quantity Sk can be com- 
puted in two ways: 

• continuously from regular updates when diffusions are 
applied in 12*; 

• periodically, by the computation of the increment of H 
(from the last time the fluids have been sent to external 
nodes) and applying diffusion on this increment (cf. 

IS)- 

This last procedure is preferred, because in this way, we 
can save the computation cost, by applying (note that the 
quantity r* need not to be updated in every computation 
step as well) the following algorithm (*): 

Choose i (in Omega_k) ; 

sent : = F [i] ; 

F[i] := 0; 

H[i] += sent; 

If ( i has at least one child in 0mega_k ) 

For all child node j of i (in 0mega_k) : 

F[j] += sent * p(j ,i) ; 

k++ ; 

At the time fluids are sent to external nodes we store [Hn]k 
in H_old, then for the next external diffusion, we can apply 
{P{H — Hoid))i^o^^ to find the fluid amount to be sent to 
each external node i. 

The transmission of fluid F„ from PIDk is done when the 
total fluid to be transmitted is above certain threshold: by 
default, we used the condition: 

Sk > rk/K. (6) 

In fact, we can even further optimize the information ex- 
change on {Fn)j^nk- this can be done directly by the re- 
ceivers and the sender PIDk can only send [Hn]k ~ 

• each PIDk computes and [Hn]k based on (*); 



N 


L (nb links) 


L/N 


D (Nb dangling nodes) 


1000 


12.935 


12.9 


41 (4.1%) 


10000 


125.439 


12.5 


80 (0.8%) 


100000 


3.141.476 


31.4 


2729 (2.7%) 


1000000 


41.247.159 


41.2 


45766 (4.6%) 



Table 1: Extracted graph: N = 1000 to 1000000. 



• when the condition (such as) ((6| is satisfied, PIDk 
sends [Hn]k to all other PIDs (we could also centralize 
the reception of this to a dedicated PID). 

The drawback of this approach is that the receiver PIDy 
needs to compute the quantity {P{[Hn\k — [Hoid]k )) ien. / 
{[Hoid\k is the lastly received \H]k from PIDk), which means 
that a vector of size N needs to be stored and instead of stor- 
ing only the column vectors of P corresponding to the set 
^k- (P)ienpenfc, we have to store (cf. [7]). 

2.6 Fluid reception from external nodes 

When new fluids are received from other PIDs, we need 
two operations: 

• add them to existing fluids (node per node; can be 
delayed) ; 

• update the threshold Tk, rescaling up by a factor: in 
our experiments, we applied a factor proportional to 
mm{{rk+received)/rk,received) (if rj, > 0, otherwise, 

Tk is set to received). 

3. EVALUATION OF THE DISTRIBUTED 
COMPUTATION COST 

The aim of this section is to show the potential of the 
distributed computation gain when the D-iteration is used. 

For the evaluation purpose, we experimented the D-iteration 
on a web graph imported from the dataset uk-2007-05@1000000 
(available on http : //law. dsi .unimi . it/datasets .php) which 
has 41.247.159 links on 1.000.000 nodes (45.766 dangling 
nodes). 

Below we vary N from 1000 to 1000000, extracting from 
the dataset the information on the first N nodes. 

3.1 Uniform partition 

We first analyse the results obtained with a uniform par- 
tition. Figure [T] shows a typical result with if = 2: we com- 
pare the evolution of the cost in terms of the number of nor- 
malized iteration for a targeted error (an upper bound of the 
distance to the limit: the y-axis shows the value rfc/(l — d) 
for each PID). In this case, PIDl computes the first half 
nodes and the last half nodes is handled by PID2. Because 
of the possible idle states of PIDs, at the end of each test, 
the computation cost of different PIDs are not necessarily 
equal. And naturally the global speed of the convergence 
is dictated by the slowest one. We observe that the av- 
erage cost of PIDl and PID2 for a given error (horizontal 
line) is roughly half of the cost with a single PID. When 
N = 100000, we see that the work load is not well balanced: 
the slopes are different almost by factor 2. 

Figure [2] shows the same result with 4 PIDs. Of course, 
when partitioned uniformly there is no reason that the slope 
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Figure 1: = 1000 and N = 100000: 1 PID vs 2 PIDs. 




0 5 10 15 20 25 30 

Nb of iterations 



Figure 2: N = 1000: 1 PID vs 4 PIDs. 



of each PID is the same: in this case, the convergence speeds 
of the different PIDs are very different. 

Note that because of the computation cost reduction ap- 
plied for the external diffusion, each PID’s slope is in fact 
closely linked to the spectral radius of the extracted square 
matrix corresponding to the nodes of the set flk (similar to 
d which plays a role for a single PID case). 

Hence, to optimize the distributed computation efficiency, 
there are two aspects: 

• first, the slope of the convergence should be equalized 
between PIDs; 

• second, there should be no idle PIDs. 

If we don’t consider the dynamic evolution of the partition 
sets, the above second point can not be controlled. However, 
the first point can be approximately optimized with the CB 
partition (cf. Section f3. 2 II . Our current understanding and 
guess is that this CB partition is not too bad for a very large 
matrix. This will be illustrated below. 

Figure |3] is a visualization of the residual fluid Vk and the 
cumulated fluid for transmission Sk'- such a view is impor- 
tant to check the stability of the algorithm (for instance, we 
should not have Sk too far above Vk). 




Figure S: N = 1000: 4 PIDs: Tk and Sk- 

Now, we evaluate the distributed computation gain for 
different values of N and K: here the convergence time is 
arbitrarily defined as the normalized number of iterations 
(given by the slowest PID) necessary to reach a target error 
of 1/A^. The results are summarized in Figure |4] when K is 
too large, there is less gain (for small N). This was expected, 
because if the sizes of Qk are too small, there is no possible 
improvement. Also when N/K is too small, the information 
exchange cost between PIDs becomes higher (which is not 
taken into account here). However, we see clearly that when 
N is large, the value of K on which the gain stays linear is 
larger. And this is the property that makes our approach 
promising. From this figure, we can possibly expect to build 
a real implementation on hundreds of virtual machines that 
could deliver a gain over a factor of 100 for say N > 10® (in 
the figure, it is reached for N = 100000, K = 1024). 

3.2 CB partition 

The illustration of the impact of CB partition is shown in 
Figure[5](|Hi I = 68671, IH 2 I = 31329): this is the typical im- 
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Figure 4: Computation speed-up factor with uni- 
form partition. 
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Figure 6 : N — 100000: impact of CB partition. 



pact that was generally observed: we see that the slopes of 
PIDs with CB partition are much closer than before. This 
is what we expected, but this is of course not always the 
case (when N/K is not large enough). Figure [6] (the respec- 
tive values of ujk are 0, 42110, 68671, 86399, 100000) shows 
another case where the slopes are made closer. 
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Figure 5 : N = 100000: impact of CB partition. 

As before, the distributed computation gain for different 
values of N and K are summarized in Figure [T] we see 
that generally the uniform partition case is improved: Figure 
[8] shows the gain in percentage when CB partition is used 
instead of uniform partition: the gain depends very much on 
the initial graph configuration (we may have naturally well 
balanced matrix or the opposite): we observed here up to 
250% gain. When K is too large, CB becomes meaningless. 

Remark 3. In Figure [3 the missing points are due to 
the impossibility (can not get K > N ) or the limitation of 
the memory size on a single PC (for N = 1000000, K up to 
256). 



3.3 Information exchange cost 

In this section, we want to evaluate the impact of the 
delayed information exchange (for instance, due to network 
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Figure 7: Computation speed-up factor with K: CB 
partition. 
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Figure 8: Speed-up gain CB/Unif. 



delays): for that purpose, in the simulation scheme above 
we added a parameter Delay_Proba such that when each 
PID decides to exchange information the transmission is 
tested every PID^Speed steps and delayed with probability 
Delay_Proba: the transmission is with 0 delay with proba- 
bility 1 — Delay J^roba and the average transmission delay 
is 



PI D^Speed x Delay ^Proba/ {1 — Delay .Proba). 

We recall that the convergence time was defined for the 
target error value of 1/N. For the required number of iter- 
ations, the normalized operation cost includes here the ele- 
mentary operation cost and the idle time (when PID^Speed 
is not totally consumed before entering in the idle state, the 
remaining capacity is counted as idle time). 
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Figure 9: Uniform partition, N = 1000: impact of 
delayed information exchange. 
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Figure 10: Uniform partition, N = 100000: impact of 
delayed information exchange. 



The results are shown in Figure [9] and Figure [TO] for N = 
1000 and N = 100000 when uniform partition is applied: 
we observe that the impact is very limited except when 
Delay J^roba is equal to 0.8. 

Figure [11] and Figure [12] for N = 1000 and N = 100000 
show the impact when CB partition is applied: we observe 



again a very limited impact when Delay-Proba is not too 
large. 




Nb of PIDs 



Figure 11: CB partition, N = 1000: impact of de- 
layed information exchange. 
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Figure 12: CB partition, N = 100000: impact of de- 
layed information exchange. 



Figure fT51 and Figure [TJ show the proportion of the idle 
states observed in these simulations. The proportion of the 
idle state is naturally defined as: 

idleJtimes / {elementary jx)st + idleJtimes). 

As expected, we observe that when K increases, the pro- 
portion of the idle state is increased (this is very clear for 
N = 100000 and IF > 10). 

The above results clearly shows that the impact of the 
delayed information is very limited when the main compu- 
tation cost is in the PIDs internal calculation cost (and the 
tolerance is quite large) . This means that thanks to our dis- 
tributed computation scheme that is really asynchronous, 
we can obtain a substantial gain with K PIDs as soon as 
the diffusion computation time on D*, is not significantly 
below the information exchange time. 

3.4 A very simple test of the partition set adap- 
tation 
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igure 13: N = 1000: proportion of the idle state. 



We chose here a case when the uniform partition did not 
correctly balance PIDs’ work load: N = 100000 and K = 2 
. Figure [TS] gives an illustration of a very simple dynamic 
adaptation scheme: in this case, we dynamically adjusted 
the value UJ 2 (+/ — 10%), starting with u }2 = 50000 (from 
the uniform partition), when: 

• the ratio of is larger than 2; 

• the ratio of the number of iterations is larger than 1.2. 
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Figure 15: Example of adaptation scheme. 

Figure[l6]shows the evolution of the uj 2 adaptation. In this 
case, the simple adaptation scheme improved very slightly 
(about 4%) the CB partition based cost. 
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Figure 14: N = 100000: proportion of the idle state. 
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Figure 16: Evolution of uj 2 . 

As it was mentioned earlier, we have to control both the 
residual fluid (y-axis) and the computation cost (x-axis). Of 
course, we have to adapt such ideas to a practical implemen- 
tation architecture. And stabilizing such a solution may not 
be obvious in a general case. 



The results presented here are yet simulation based eval- 
uation of the computation efficiency of an asynchronous dis- 
tributed architecture based on the D-iteration method. We 
believe the model used here is realistic enough and shows a 
promising new solution to solve linear equations. 



4. CONCLUSION 

In this paper, we presented preliminary simulation results 
on the evaluation of a distributed solution associated to the 
D-iteration method. The main aim of this study was to 
estimate the potential of our approach before a real imple- 
mentation and benchmarking of such a solution. We showed 
that our algorithm is very well suited for the parallel com- 
putation of really large linear systems for at least three rea- 
sons: firstly, each entity (PID) needs to keep and update 
only the local information (distributed memory), the locally 
kept information size decreasing linearly with K the number 
of PIDs; secondly, the efficiency of the parallel computation 
increases with the size N and finally, the computation speed 
increases almost linearly with K when N /K \s large enough. 
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